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Abstract — Kernel-based mean shift (MS) trackers have proven 
to be a promising alternative to stochastic particle filtering track- 
ers. Despite its popularity, MS trackers have two fundamental 
drawbacks: (1) The template model can only be built from a 
single image; (2) It is difficult to adaptively update the template 
model. In this work we generalize the plain MS trackers and 
attempt to overcome these two limitations. 

It is well known that modeling and maintaining a repre- 
sentation of a target object is an important component of a 
successful visual tracker. However, little work has been done on 
building a robust template model for kernel-based MS tracking. 
In contrast to building a template from a single frame, we 
train a robust object representation model from a large amount 
of data. Tracking is viewed as a binary classification problem, 
and a discriminative classification rule is learned to distinguish 
between the object and background. We adopt a support vector 
machine (SVM) for training. The tracker is then implemented 
by maximizing the classification score. An iterative optimization 
scheme very similar to MS is derived for this purpose. Compared 
with the plain MS tracker, it is now much easier to incorporate 
on-line template adaptation to cope with inherent changes during 
the course of tracking. To this end, a sophisticated on-line support 
vector machine is used. We demonstrate successful localization 
and tracking on various data sets. 

Index Terms — Kernel-based tracking, mean shift, particle filter, 
support vector machine, global mode seeking. 



I. Introduction 

Visual localization/tracking plays a central role for many 
applications like intelligent video surveillance, smart trans- 
portation monitoring systems etc. Localization and tracking 
algorithms aim to find the most similar region to the target in 
an image. Recently, kernel-based tracking algorithms |1 1, |2|, 
O have attracted much attention as an alternative to particle 
filtering trackers yj, ISJ, 16|. One of the most crucial diffi- 
culties in robust tracking is the construction of representation 
models (likelihood models in Bayesian filtering trackers) that 
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can accommodate illumination variations, deformable appear- 
ance changes, partial occlusions, etc. Most current tracking 
algorithms use a single static template image to construct a 
target representation based on density models. For both kernel- 
based trackers and particle filtering trackers, a popular method 
is to exploit color distributions in simple regions (region-wise 
density models). Generally semi-parametric kernel density 
estimation techniques are adopted. However, it is difficult 
to update this target model 111, O, IH, (T), and the target 
representation's fragility usually breaks these trackers over a 
long image sequence. 

Considerable effort has been expended to ease these diffi- 
culties. We believe that the key to finding a solution is to find 
the right representation. In order to accommodate appearance 
changes, the representation model should be learned from as 
many training examples as possible. Fundamentally two meth- 
ods, namely on-line and off-line learning, can be used for the 
training procedure. On-line learning means constantly updat- 
ing the representation model during the course of tracking. 1 8 1 
proposes an incremental eigenvector update strategy to adapt 
the target representation model. A linear probabilistic principal 
component analysis model is used. The main disadvantage of 
the eigen-model is that it is not generic and is usually only 
suitable for characterizing texture-rich objects. In O a wavelet 
model is updated using the expectation maximization (EM) 
algorithm. A classification function is progressively learned 
using AdaBoost for visual detection and tracking in 1 10] and 
1 11 1 respectively. |12| adopts pixel-wise Gaussian mixture 
models (GMMs) to represent the target model and sequentially 
update them. To date, however, less work has been reported 
on how to elegantly update region-wise density models in 
tracking. 

In contrast, classificatior[^ is a powerful bottom-up pro- 
cedure: It is trained off-line and works on-line. Due to the 
training being typically built on very large amounts of training 
data, its performance is fairly promising even without on- 
line updating of the classifier/detector. Inspired by image 
classification tasks with color density features and real-time 
detection, we learn off-line a density representation model 
from multiple training data. By considering tracking as a 
binary classification problem, a discriminative classification 
rule is learned to distinguish between the tracked object and 
background patterns. In this way a robust object representation 
model is obtained. This proposal provides a basis for consider- 
ing the design of enhanced kernel-based trackers using robust 
kernel object representations. A by-product of the training is 
the classification function, with which the tracking problem is 
cast into a binary classification problem. An object detector 
directly using the classification function is then available. 

^Object detection is typically a classification problem. 
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Combining a detector into the tracker makes the tracker more 
robust and provides the capabiHties of automatic initiaUzation 
and recovery from momentary tracking failures. 

In theory, many classifiers can be used to achieve our goal. 
In this paper we show that the popular kernel based non-linear 
support vector machine (SVM) well fits the kernel-based track- 
ing framework. Within this framework the traditional kernel 
object trackers proposed in 1 1 1 and (13] can be expressed as 
special cases. Because we use probabilistic density features, 
the learning process is closely related to probabilistic kernels 
based SVMs | SI, (HI, (161, (13 • It is imperative to minimize 
computational costs for real-time applications such as tracking. 
A desirable property of the proposed algorithm is that the 
computational complexity is independent of the number of 
support vectors. Furthermore we empirically demonstrate that 
our algorithm requires fewer iterations to achieve convergence. 

Our approach differs from fTS | although both use the SVM 
classification score as the cost function. In 1 18], Avidan builds 
a tracker along the line of standard optical flow tracking. Only 
the homogeneous quadratic polynomial kernel (or kernels with 
a similar quadratic structure) can be used in order to derive 
a closed-form solution. This restriction prevents one using 
a more appropriate kernel obtained by model selection. An 
advantage of |18 | is that it can be used consistently with the 
optical flow tracking, albeit only gray pixel information can be 
used. Moreover, the optimization procedure of our approach 
is inspired by the kernel-based object tracking paradigm | 
Hence extended work such as 1 2 1 is also applicable here, which 
enables us to find the global optimum. If joint spatial-feature 
density is used to train an SVM, a fixed-point optimization 
method may also be derived that is similar to (T3l . The classi- 
fication function of the SVM trained for vehicle recognition is 
not smooth w.r.t. spatial mis-registration (see Fig. 1 in (191). 
We employ a spatial kernel to smooth the cost function when 
computing the histogram feature. In this way, gradient based 
optimization methods can be used. Using statistical learning 
theory, we devise an object tracker that is consistent with 
MS tracking. The MS tracker is initially derived from kernel 
density estimation (KDE). Our work sheds some light on the 
connection between SVM and KD^ 

Another important part of our tracker is its on-line re- 
training in parallel with tracking. Continuous updating of 
the representation model can capture changes of the target 
appearance/backgrounds. Previous work such as f9\, fTPl, fSl, 
(T2II has demonstrated the importance of this on-line update 
during the course of tracking. The incremental SVM technique 
meets this end f22l|, (23l, (24l, f25|, which efficiently updates 
a trained SVM function whenever a sample is added to or 
removed from the training set. For our proposed tracking 
framework, the target model can be learned in either batch 
SVM training or on-line SVM learning. We adopt a sophisti- 
cated on-line SVM learning proposed in (241 for its efficiency 
and simplicity. We address the crucial problem of adaptation, 
i.e., the on-line learning of discriminant appearance model 
while avoiding drift. 

^It is believed that statistical learning theory (SVM and many other kernel 
learning methods) can be interpreted in the framework of information theoretic 
learning f20i. f2Tl. 



The main contributions of our work are to solve MS 
trackers' two drawbacks: The template model can only be built 
from a single image; and it is difficult to update the model. The 
solution is to extend the use of statistical learning algorithms 
for object localization and tracking. SVM has been used for 
tracking by means of spatial perturbation of the SVM |18|. 
We exploit SVM for tracking in a novel way (along the line 
of MS tracking). The key ingredients of our approach are: 

• Probabilistic kernel based SVMs are trained and incor- 
porated into the framework of MS tracking. By carefully 
selecting the kernel, we show that no extra computation 
is required compared with the conventional single-view 
MS tracking. 

• An on-line SVM can be used to adaptively update the 
target model. We demonstrate the benefit of on-line target 
model update. 

• We show that the annealed MS algorithm proposed in 
|2 | can be viewed as a special case of the continuation 
method under an appropriate interpretation. With the new 
interpretation, annealed MS can be extended to more 
general cases. Extension and new discovers are discussed. 
An efficient localizer is built with global mode seeking 
techniques. 

• Again, by exploiting the SVM binary classifier, it is 
able to determine the scale of the target. An improved 
annealed MS -like algorithm with a cascade architecture 
is developed. It enables a more systematic and easier 
design of the annealing schedule, in contrast with ad hoc 
methods in previous work |2|. 

The remainder of the paper is organized as follows. In 
pl| the general theory of MS tracking and SVM is reviewed 
for completeness. Our proposed tracker is presented in pll 



Finally experimental results are reported in ^ IV We conclude 
this work in ^ 

II. PRELIMINARIES 

For self-completeness, we review mean shift tracking, sup- 
port vector machine and its on-line learning version in this 
section. 

A. Mean Shift Tracking 

Mean shift (MS) tracking was firstly presented in |T|. In 
MS tracking, the object is represented by a square region 
which is cropped and normalized into a unit circle. By 
denoting q as the color histogram of the target model, and 
p(c) as the target candidate color histogram with the center 
at c, the similarity function between q and p(c) is (when 
Bhattacharyya divergence 1 1| is used). 



dist(q,p(c)) = \/l - ^(q, p). 

Here ^(q, p) = ^q^^p is the dissimilarity measurement. Let 
{I^}^^^ be a region's pixel positions in image I with the center 
at c. In order to make the cost function smooth — otherwise 
gradient based MS optimization cannot be applied — a kernel 
with profile k{-) is employed to assign smaller weights to those 
pixels farther from the center, considering the fact that the 
peripheral pixels are less reliable. An m-bin color histogram 
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is built for an image patch located at c, q(c) 
where 

n II c I 2^ 



{^n(c)K 



Here /c(-) is the homogeneous spatial weighting kernel profile 
and h is its bandwidth. 5{-) is the delta function and A 
normalizes q. The function ^^(I^) maps a feature of into 
a histogram bin u. c the kernel center; and for the target 
model usually c = 0. The representation of candidate p takes 
the same form. 

Given an initial position cq, the problem of localiza- 
tion/tracking is to estimate a best displacement Ac such that 
the measurement p(co + Ac) at the new location best matches 
the target q, i.e., 

Ac"^ = argmin^^ dist(q, p(co + Ac)). 

By Taylor expanding dist(q, p(c)) at the start position cq 
and keeping only the linear item (first-order Taylor approxi- 
mation), the above optimization problem can be resolved by 
an iterative procedure: 



En — 
^=1 



(2) 



where g{-) = — fc'(-) and the superscript r = 0,1,2..., 
indexes the ite ration step. The weights are calculated as: 
= Er=i - u). See LU for details. 

B. Support Vector Machines 

We limit our explanation of the support vector machine 
classifiers algorithm to an overview. 

Large margin classifiers have demonstrated their advantages 
in many vision tasks. SVM is one of the popular large margin 
classifiers |26| which has a very promising generalization 
capacity. 

The linear SVM is the best understood and simplest to 
apply. However, linear separability is a rather strict condition. 
Kernels are combined into margins for relaxing this restriction. 
SVM is extended to deal with linearly non-separable problems 
by mapping the training data from the input space into a 
high-dimensional, possibly infinite-dimensional, feature space, 
i.e., $(•) : X T. Using the kernel trick, the map 
<!>(•) is not necessarily known explicitly. Like other kernel 
methods, SVM constructs a symmetric and positive definite 
kernel matrix (Gram matrix) which represents the similar- 
ities between all training datum points. Given N training 
data {(x^,^^)}^^, the kernel matrix is written as: Ki^ = 
i^(xi,x^-) = (<l>(xO,<l>(x^-)).^,J = l---^". When K^^ is 
large, the labels of x^ and Xj, yi and y^, are expected to be 
the same. Here, yi^y^ ^ {+1,-1}. The decision rule is given 
by sign (/(x)) with 

/(x) = ^Ai^(x„x) + 6 (3) 

where x^ G A', z = 1 • • • Ns, are support vectors, Ns is the 
number of support vectors, f5i is the weight associated with 
X., and h is the bias. 



The training process of SVM then determines the parame- 
ters {ki^f^i^h^Ns} by solving the optimization problem 



1 _ ^A/^ 

minimize -||'^| 
subject to yi{w'^ ^{-Xi) + 6) > 1 — Vi, 



EN 
i=i 



(4) 



6 > 0, Vi, 

where ^ = {^^}^^ is the slack variable set and the reg- 
ularization parameter C determines the trade-off between 
SVM's generalization capability and training error, r = 1,2 
corresponds to 1-norm and 2 -norm SVM respectively. The 
solution takes the form w = YliLiVi^^i^i'^i)- Here, > 
and most of them are 0, yielding sparseness. The optimization 
^ can be efficiently solved by linear programming (1-norm 
SVM) or quadratic programming (2-norm SVM) in its dual. 
Refer to |26| for details. 



C. On-line Learning with Kernels 

A simple on-line kernel-based algorithm, termed NORMA, 
has been proposed for a variety of standard machine learning 
tasks in |24|. The algorithm is computationally cheap at each 
update step. We have implemented Norma here for on-line 
SVM learning. See Fig. 1 in |24| for the backbone of the 
algorithm. We omit the details due to space constraint. 

As mentioned, visual tracking is naturally a time-varying 
problem. An on-line learning method allows updating the 
model during the course of tracking. 

HL Generalized Kernel-based Tracking 

The standard kernel-based MS tracker is generalized by 
maximizing a sophisticated cost function defined by SVM. 



A. Probability Product Kernels 

Measuring the similarity between images and image patches 
is of central importance in computer vision. In SVMs, the 
kernel •) plays this role. Most commonly used kernels 
such as Gaussian and polynomial kernels are not defined on the 
space of probability distributions. Recently various probabilis- 
tic kernels have been introduced, including the Fisher kernel 
1 14], TOP lUl Kullback-Leibler kernel 1 16] and probability 
product kernels (PPK) |17|, to combine generative models 
into discriminative classifiers. A probabilistic kernel is defined 
by first fitting a probabilistic model p(x^) to each training 
vector x^. The kernel is then a measure of similarity between 
probability distributions. PPK is an example |[T7l . with kernel 
given by 

i^;(q(x),p(x))= / q(x)^p(x)^dx (5) 

where p is a constant. When p = ^ , PPK reduces to a special 
case, termed the Bhattacharyya kernel: 



ifi(q(x),p(x))= / y^y^dx. 



(6) 
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In the case of discrete histograms, i.e., q(x) = [qi ■ - ■ qmV 
and p(x) = [pi- - -prn]^, ^ becomes 

m 

i^l(q(x),p(x)) = y^^y^= ^y^. (7) 



When p = 1, K^{',') computes the expectation of one 
distribution over the other, and hence is termed the expected 
likeHhood kernel ifTTl . In 1271 its corresponding statistical 
affinity is used as similarity measurement for tracking. The 
Bhattacharyya kernel is adopted in this work due to: 

• The standard MS tracker Q uses the Bhattacharyya 
distance. It is clearer to show the connection between the 
proposed tracker and the standard MS tracker by using 
Bhattacharyya kernel. 

• It has been empirically shown, at least for image classifi- 
cation, that the generalization capability of expected like- 
lihood kernel •) is weaker than the Bhattacharyya 
kernel. Meanwhile, non-linear probabilistic kernels in- 
cluding Bhattacharyya kernel, Kullback-Leibler kernel, 
Renyi kernel etc. perform similarly |28|. Moreover, Bhat- 
tacharyya kernel is simple and has no kernel parameter 
to tune. 

The PPK has an interesting characteristic that the mapping 
function <!>(•) is explicitly known: <l>(q(x)) = q(x)^. This 
is equivalent to directly setting x = q(x)^ and the kernel 
i^^(x^,Xj) = x^Xj. Consequently for discrete PPK based 
SVMs, in the test phase the computational complexity is 
independent of the number of support vectors. This is easily 
verified. The decision function is 

Ns 

/(x) = ^A[q(x.rrp(xr+& 



=1 

Ns 



p(x)^ + b. 



The first term in the bracket can be calculated beforehand. For 
example, for histogram based image classification like 1291 , 
given a test image x, the histogram vector p(x) is immediately 
available. In fact we can interpret discrete PPK based SVMs 
as linear SVMs in which the input vectors are q(x^)^ — the 
features non-linearlj^ extracted from image densities. Again, 
one might argue that, since the Bhattacharyya kernel is very 
similar to the linear SVM, it might not have the same power in 
modelling complex classification boundaries as the traditional 
non-linear kernels like the Gaussian or polynomial kernel. 
The experiments in [28 1 indicate that the classification perfor- 
mance of a probabilistic kernel which consists an exponential 
calculation is not clearly better: exponential kernels like the 
Kullback-Leibler kernel and Renyi kernel performs similarly 
as Bhattacharyya kernel on various datasets for image classifi- 
cation. Moreover our main purpose is to learn a representation 
model for visual tracking. Unlike other image classification 

^ When p = 1, it is linear. The non-linear probabilistic kernels induce 
a transformed feature space (as the Bhattacharyya kernel does) to smooth 
density such that they significantly improve classification over the linear kernel 

ESI. 



tasks — in which high generalization accuracy is demanded — 
for visual tracking achieving very high accuracy might not be 
necessary and may not translate to a significant increase in 
tracking performance. 

Note that PPKs are less compelling when the input data are 
vectors with no further structure. However, even the Gaussian 
kernel is a special case of PPK (p = 1 in Equation ([Sj and 
p(x) is a single Gaussian fit to x^ by maximum likelihood) 
117]. 

By contrast, the reduced set method is applied in ifTSl to 
reduce the number of support vectors for speeding up the clas- 
sification phase. Applications which favour fast computation 
in the testing phase, such as large scale image retrieval, might 
also benefit from this discrete PPK's property. 

B. Decision Score Maximization 

It is well known that the magnitude of the SVM score 
|/(x) I measures the confidence in the prediction. The proposed 
tracking is based on the assumption that the local maximum 
of the SVM score corresponds to the target location we seek, 
starting from an initial guess close to the target. 

If the local maximum is positive, the tracker accepts the 
candidate. Otherwise an exhaustive search or localization 
process will start. The tracked position at time t is the initial 
guess of the next frame t-\-l and so forth. We now show how 
the local maximum of the decision score is determined. 

As in |1|, a histogram representation of the image region 
can be computed as Equation ([T]). 

With Equations ([3]), and ([T]), we hav^ 

Ns m 

/(c) = ^ A E ^q^,uPu{c) + b. (8) 

i=l u=l 

We assume the search for the new target location starts from a 
near position cq, then a Taylor expansion of the kernel around 
Pu{co) is applied, similar to Q. After some manipulations 
and putting those terms independent of c together, denoted by 
A, ^ becomes 



/(c) 



1 Ns 

5S 



A y^Pn(c) 



=1 



where 



and 



Ns 



^2/ 



Ns n T 2 

^ =1 



I 



u=l 



ir-irf) 



P«(co) 



Sim) - u) 



(9) 



(10) 



=1 




\/P«(co) 



Sim-u)). (11) 



^ We have used x to represent the image region. We also use the image 
center c to represent the image region x. For clarity we define notation qi,u = 
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Here ([9]) is obtained by swapping the order of summation. The 
first term of /(c) is the weighted kernel density estimate with 
kernel profile /c(-) at c. It is clear now that our cost function 
/(c) has an identical format as the standard MS tracker. 

Can we simply set Vc/(c) = which leads to a fixed-point 
iteration procedure to maximize /(c) as the standard MS does? 
If it works, the optimization would be similar to (|2]). 

Unfortunately, Vc/(c) = cannot guarantee a local maxi- 
mum convergence. That means, the fixed point iteration ^ can 
converge to a local minimum. We know that only when all the 
weights are positive, ^ converges to a local maximum — 
as the standard MS does. See Appendix for the theoretical 
analysis. 

However, in our case, a negative support vector's weight 
Pi is negative, which means some of the weights computed 
by ( pT] ) could be negative. The traditional MS algorithm 
requires that the sample weights must be non-negative. |30| 
has discussed the issue on MS with negative weights and 
a heuristic modification is given to make MS able to deal 
with samples with negative weights. According to 1301 , the 
modified MS is 

Here | • | is the absolute value operation. Alas this heuristic 
solution is problematic. Note that no theoretical analysis is 
given in |30|. We show that the methods in [301 cannot guar- 
antee converging to a local maximum mode. See Appendix 
for details. 

The above problem may be avoided by using 1 -class SVMs 
1311 in which 5;^ is strictly positive. However the discrimina- 
tive power of SVM is also eliminated due to its unsupervised 
nature. 

In this work, we use a Quasi-Newton gradient descent 
algorithm for maximizing /(c) in ([9]). In particular, the L- 
BFGS algorithm | 32| is adopted for implementing the Quasi- 
Newton algorithm. We provide callbacks for calculating the 
value of the SVM classification function /(c) and its gradient. 
Typically, only few iterations of the optimization procedure 
are performed at each frame. It has been shown that Quasi- 
Newton can be a better alternative to MS optimization for 
visual tracking [33 1 in terms of accuracy. Quasi-Newton was 
also used in 1 34 1 for kernel-based template alignment. Besides, 
in O the authors have shown that Quasi-Newton converges 
around twice faster than the standard MS does for data 
clustering. 

The essence behind the proposed SVM score maximization 
strategy is intuitive. The cost function ^ favors both the 
dissimilarity to negative training data {e.g., background) and 
the similarity to positive training data. Compared to the 
standard MS tracking, our strategy provides the capability 
to utilize a large amount of training data. The terms with 
positive (3 in the cost function play the role to attract the 
target candidate while the negative terms repel the candidate. 
In 1 35 1, 1361 Zhao et al. have extended MS tracking by 
introducing a background term to the cost function, i.e., 
/(c) = A/i^t(q,p(c)) - A5i^t(6(c),p(c)). 6(-) is the 

2 2 



background color histogram in the corresponding region. It 
also linearly combines both positive and negative terms into 
tracking and better performance has been observed. It is simple 
and no training procedure is needed. Nevertheless it lacks 
an elegant means to exploit available training data and the 
weighting parameters A / and A5 need to be tuned manuall}j^ 
The original MS tracker's analysis relies on kernel prop- 
erties We argue that the main purpose of the kernel 
weighting scheme is to smooth the cost function such that 
iterative methods are applicable. Kernel properties then derive 
an efficient MS optimization. As observed by many other 
authors 1 331 . 1371 . the kernels used as weighting kernel density 
estimation 1381 . 1391 . We can simply treat the feature distribu- 
tion as a weighted histogram to smooth the cost function and, 
at the same time, to account for the non-rigidity of tracked 
targets. 

Note that (1) the optimization reduces to the standard MS 
tracking if Ns = 1; (2) Other probability kernels like i^J(-, •) 
are also applicable here. The only difference is that Wi^^^ in 
([To]) will be in other forms. 

In previous contents we have shown that in the testing phase 
discrete PPK's support vectors do not introduce extra computa- 
tion. Again, for our tracking strategy, no computation overhead 
is introduced compared with the traditional MS tracking in 
This can be seen from Equation ([TT]). The summation in 
( pT] ) (the bracketed term) can be computed off-line. The only 
extra computation resides in the training phase: the proposed 
tracking algorithm has the same computation complexity as 
the standard MS tracker. It is also straightforward to extend 
this tracking framework to spatial-feature space |[T3l which 
has proved more robust. 

C. Global Optimum Seeking 

A technique is proposed in |2|, dubbed annealed mean 
shift (AnnealedMS), to reUably find the global density 
mode. AnnealedMS is motivated by the observation that the 
number of modes of a kernel density estimator with a Gaussian 
kernel is monotonically non-increasing w.r.t. the bandwidth of 
the kernel. 

Here we re-interpret this global optimization and show that 
it is essentially a special case of the continuation approach 
|40|. With the new interpretation, it is clear now that this 
technique is applicable to a broader types of cost functions, 
not necessary to a density function. 

The continuation method is one of the unconstrained global 
optimization techniques which shares similarities with de- 
terministic annealing. A series of gradually deformed but 
smoothed cost functions are successively optimized, where the 
solution obtained in the previous step serves as an initial point 
in the current step. This way the convergence information is 
conveyed. With sufficient smoothing, the first cost function 
will be concave/convex such that the global optimum can be 
found. The algorithm iterates until it traces the solution back to 
the original cost function. We now recall some basic concepts 
of the continuation method. 

^Zhao et al. |[35l, (SI did not correctly treat MS iteration with negative 
weights either because they have used Collins' modified MS (Equation {T2j). 
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Fig. 1. Examples of the initial position (dashed line) and the final convergence position (solid line). Squared dots show the optimization convergence 
trajectory. The image size in all tests is 320 x 211. The object size is 60 x 50 for the first example and 35 x 25 for the other two. The bars under every test 
image indicate the SYM score at each gradient-ascent iteration. The SYM score change is: (left) initial: —1.41, final: 2.77; (middle) initial: —0.86, final: 
1.41; (right) initial: -1.04, final: 1.62. 



Definition 1 ( HOl ). Given a non-linear function f, the trans- 
formation (f)^ for f is defined such that\I^, 

2\ 



j /(x')fc(||^^||')dx', (13) 



(/)^ (x) = Ch 



where k{-) is a smoothing function; usually the Gaussian 
is used, h is a positive scalar which controls the degree 
of smoothing^ Ch is a normalization constant such that 

X 



Ch k 



dx= 1. 



Note the similarity between the smoothing function k{-) and 
the definition of the kernel in KDE. From ([13]), the defined 
transformation is actually the convolution of the cost function 
with k{-). In the frequency domain, the frequency response of 
(/)^ equals the product of the frequency responses of / and 
k. Being a smoothing filter, the effect of k{-) is to remove 
high frequency components of the original function. Therefore 
one of the requirements for k{-) is its frequency response 
must be a low-pass frequency filter. We know that popular 
kernels like Gaussian or Epanechnikov kernel are low-pass 
frequency filters. This is one of the principle justifications for 
using Gaussian or Epanechnikov to smooth a function. When 
h is increased, (/)^ becomes smoother and for h = 0, the 
function is the original function. 

Theorem 1. The annealed version of mean shift introduced in 
for global mode seeking is a special case of the general 
continuation method defined in Equation ([13]). 

Proof: Let the original function /(x') take the form of a 
Dirac delta comb (a.k.a. impulse train in signal processing), 
i.e., /(x^) = J(x' — x^), where x^ is known. With the fun- 
damental property that F(x)J(x — x)(ix = F{x.) for any 

2x 



function F{'), we have (f)^ 



,(x) = Ch^k 



^^'"Vll h 

i 

This is exactly same as a KDE. This discovers that An- 
ne aledMS is a special case of the continuation method. 
When /(x') = '^.Wi6{x.' — ^i) with k;^ G M (wi can 
be negative), the above analysis still holds and this case 
corresponds to the SVM score maximization in §III-B[ ■ 
It is not a trivial problem to determine the optimal scale of 
the spatial kernel bandwidth, i.e., the size of the target, for 
kernel-based tracking. A line search method is introduced in 




Fig. 2. A close look at the cost function of the first example in Fig.[^ (left) 
SVM score; (right) Bhattacharyya distance of standard mean shift. Note that 
for the standard mean shift, the target model is extracted from the same test 
image; while for SVM, the target model is learned from a large number of 
training images that do not contain the test image. 



|30|. For AnnealedMS, an important open issue is how to 
design the annealing schedule. Armed with an SVM classifier, 
it is possible to determine the object's scale. If only the color 
feature is used, due to its lack of spatial information and 
insensitive to scale change, it is difficult to estimate a fine scale 
of the target. By combining other features, better estimates 
are expected. As we will see in the experiments, reasonable 
results can be obtained with only color. It is natural to combine 
AnnealedMS into a cascade structure, like the cascade 
detector of 1411 . We start MS search from a large bandwidth 
ho. After convergence, an extra verification is applied to decide 
whether to terminate the search. If sign(/(Io)) = —1, it means 
ho is too large. Then we need to reduce the bandwidth to hi 
and start MS with the initial location Iq. This procedure is 
repeated until sign(/(I^)) = +1, m G {0, • • • , M}. hm and 
Im are the final scale and position. Little extra computation 
is needed because only a decision verification is introduced at 
each stage. 

IV. Experiments 

In this section we implement a localizer and tracker and 
discuss related issues. Experimental results on various data 
sets are shown. 

A. Localization 

For the first experiment, we have trained a face representa- 
tion model. 404 faces cropped from CalTech-101 are used as 
positive raw images, and 1400 negative images are randomly 
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Fig. 3. Face localization. The final decision is marked with a rectangle. The image size in all tests is 240 x 180. In the first test (left), the proposed cascade 
localizer works very well. For the second one (middle), the detected scale of the target is slightly big, but acceptable. The SVM scores for the first example 
are also plotted (right). The first iteration at each bandwidth is marked with a solid circle. 



cropped from images which do not contain faces. The image 
size is reduced to 42 x 56 pixels. Kernel- weighted RGB colour 
histograms, consisting ofl6xl6xl6 bins, are extracted 
for classification. By default we use a soft SVM trained with 
LIB SVM (slightly modified to use customized kernels). Test 
accuracy on the training data is 99.5% (1795/1804); and 
91.7% (2752/3000) on a test data set which contains totally 
3000 negative data. Note that our main purpose is not to train a 
powerful face detector; rather, we want to obtain an appearance 
model that is more robust than the single-view appearance 
model. We now test how well the algorithm maximizes the 
SVM score. First, we feed the algorithm a rough initial guess 
and run MS. See Fig. [T] for details. 

The first example in Fig. [T] comes from the training data set. 
The initial SVM score is negative. In this case, a single step 
is required to switch to a positive score — it moves closely to 
the target after one iteration. We plot the corresponding cost 
function in Fig. |2] By comparison, the cost function of the 
standard MS is also plotted (the target template is cropped 
from the same image). We can clearly see the difference. The 
other two test images are from outside of the training data set. 
Despite the significant face color difference and variation in 
illumination, our SVM localizer works well in both tests. To 
compare the robustness, we use the first face as a template to 
track the second face in Fig. [T] the standard MS tracker fails 
to converge to the true position. 

We now apply the global maximum seeking algorithm 
to object localization. In 01, it has been shown that it is 
possible to locate a target no matter from which initial position 
the MS tracker starts. Here we use the learned classifica- 
tion rule to determine when to stop searching. We start the 
annealed continuation procedure with the initial bandwidth 
= (42,56). Then the bandwidth pyramid works with the 
rule hm-\-i = Y25, 'm G {0, • • • ,M}. M is the maximum 
number of iterations. We stop the search when for some m the 
SVM score is positive upon convergence. The image center is 
set to be the initial position of the search for these 2 tests. We 
present the results in Fig. [3] 

In the first test, our proposed algorithm works well: It 
successfully finds the face location, and also the final band- 
width well fits the target. Fig. [3] (right) shows how the SVM 
score evolves. It can be seen that every bandwidth change 
significantly increases the score. If the target size is large and 



there is a significant overlap between the target and a search 
region at a coarse bandwidth, hm, the overlap can make the 
cascade search stop prematurely (see the second test in Fig.[3|. 
Again this problem is mainly caused by the color feature's 
weak discriminative power. A remedy is to include more 
features. However, for certain applications where the scale-size 
is not critically important, our localization results have been 
usable. Furthermore, better results could be achieved when we 
train a model for a specific object {e.g., train an appearance 
model for a specific person) with a single color feature. 

B. Tracking 

Effectiveness of the proposed generalized kernel-based 
tracker is tested on a number of video sequences. We have 
compared with two popular color histogram based methods: 
the standard MS tracker |1| and particle filters [fl. 

Unlike the first experiment, we do not train an off-line SVM 
model for tracking. It is not easy to have a large amount of 
training data for a general object, therefore in the tracking 
experiment, an on-line SVM described in ^II-C is used for 



training. The user crops several negative data and positive data 
for initial training. During the course of tracking the on-line 
SVM updates its model by regarding the tracked region as 
a positive example and randomly selecting a few sub-regions 
(background area) around the target as negative examples. A 
16 X 16 X 16-binned color histogram is used for both the 
generalized kernel tracker and standard MS tracker. For the 
particle filter, with 1000 or 800 particles, the tracker fails at 
the first a few frames. So we have used 1500 particles. 

In the first experiment, the tracked person moves quickly. 
Hence the displacement between neighboring frames is large. 
The illumination also changes. The background scene is clut- 
tered and contains materials with similar color as the target. 
The proposed algorithm tracks the whole sequence success- 
fully. Fig. |4] summarizes the tracking results. The standard MS 
tracker fails at frame #57; recovers at frame #74 and then fails 
again. The particle filter also loses the target due to motion 
blur and fast movement. Our on-line adaptive tracker achieves 
the most accurate results. 

Fig. |5] shows that the results on a more challenging video. 
The target turns around and at some frames it even moves out 
of the view. At frame #194, the target disappears. Generalized 
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Fig. 4. Face sequence 1. Tracking results of the proposed tracker (top 
row); standard mean shift tracker (middle) and particle filtering (bottom row). 
Frames 26, 56, 318, 432 are shown. The video size is 320 x 240 and the 
frame rate is 10 frames per second (FPS). 
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Fig. 5. Face sequence 2. Tracking results of the proposed tracker (top 
row); standard mean shift tracker (middle) and particle filtering (bottom row). 
Frames 86, 135, 204, 512 are shown. The video size is 320 x 240 and the 
frame rate is 10 FPS. 

kernel tracker and particle filter recovers at the following 
frames while the MS tracker fails. Again we can see the pro- 
posed tracker performs best due to its learned template model 
and on-line adaptivity. When the head turns around, all trackers 
can lock the target because compared with the background, 
the hair color is more similar to the face color. These two 
experiments show the proposed tracker's robustness to motion 
blur, large pose change and target's fast movement over the 
standard MS tracker and particle filter based tracker. In the 
experiments, to initialize the proposed tracker, we randomly 
pick up a few negative samples from the background. We have 
found this simple treatment works well. 

We present more samples from three more sequences in 
Figs. [6] |7] and [8] We mark only our tracker in these frames. 
From Figs. |6] and [7] we see that despite the target moving into 
shadow at some frames, our tracker successfully tracks the 
target through the whole sequences. 

We have shown promising tracking results of the proposed 
tracker on several video clips. We now present some quanti- 
tative comparisons of our algorithm with other trackers. 

First, we run the proposed tracker, MS, and particle filter 
trackers on the cubicle sequence 1. In Fig. [9j we show 
some tracking frames of our method and particle filtering. 
Compared with particle filtering, ours are much better in terms 
of accuracy and much faster in terms of the tracking speed. Our 



results are also slightly better than the standard MS tracker. 
But visually there is no significant difference, so we have not 
included MS results in Fig. |9] 

Again, the particle filter tracker uses 1500 particles. We have 
run the particle filter 5 times and the best result is reported. 
Fig. [To] shows the absolute deviation of the tracked object's 
center at each frame. Clearly the generalized kernel tracker 
demonstrates the best result. We have reported the average 
tracking error (the Euclidean distance of the object's center 
against the ground truth) in Table |l| which shows the proposed 
tracker outperforms MS and particle filter. In Table [l| the error 
variance estimates are calculated from the tracking results of 
all frames regardless the target is lost or not. We have also 
proved the importance of on-line SVM update. As mentioned, 
when we switch off the on-line update, our proposed tracker 
would behave similarly to the standard MS tracker. We see 
from Table [l| that even without updating, the generalized kernel 
tracker is slightly better than the standard MS tracker. This 
might be because the initialization schemes are different: the 
generalized kernel tracker can take multiple positive as well 
as negative training examples to learn an appearance model, 
while MS can only take a single image for initialization. 
Although we only use very few training examples (less than 
10), it is already better than the standard MS tracker. In this 
sequence, when the target object is occluded, the particle filter 
tracker only tracks the visible region such that the deviation 
becomes large. Our approach updates the learned appearance 
model using on-line SVM. The region that partially contains 
the occlusion is added to the object class database gradually 
based on the on-line update procedure. This way our tracker 
tracks the object position close to the ground truth. 

We also report the tracking failure rate (FR) for this video, 
which is the percentage of the number of failure frames in the 
total number of frames. If the distance between the tracked 
center and the ground truth's center is larger than a threshold, 
we mark it a failure. We have defined the threshold as 0.20 or 
0.25 of the diagonal length of the ground truth's bounding box, 
which results in two criteria: FR0.20 and FR0.25 respectively. 
The former is more strict than the latter. As shown in Table |l| 
our tracker with on-line update produces lowest tracking 
failures under either criterion. 

TABLE I 

The average tracking error against the ground truth (pixels) 

ON THE cubicle SEQUENCE 1. THE MEAN AND STANDARD DEVIATION 
ARE REPORTED. WE ALSO REPORT THE TRACKING FAILURE RATES. 





MS 


Particle filter 


Ours (w/o update) 


Ours (update) 


error 


9.6 ±5.7 


10.5 ±5.8 


8.5 ±4.9 


6.5 ±2.8 


FR0.20 


44.0% 


44.0% 


28.0% 


6.0% 


FRo.25 


16.0% 


34.0% 


14.0% 


0.0% 



We also compare the running time of trackers, which is an 
important issue for real-time tracking applications. Table |ll| 
reports the results on two sequences |j The generalized kernel 
tracker (around 65 fps) is comparable to the standard MS 

^AU algorithms are implemented in ANSI C++. We have made the codes 
available at http://code.google.eom/p/detect/ A desktop with Intel Core"^^ 
Duo 2.4-GHz CPU and 2-G RAM is used for running all the experiments. 
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Fig. 6. Walker sequence 1. Tracking results of the proposed generalized kernel tracker Frames 20, 40, 60, 90, 115, 130 are shown. The video size is 
384 X 288 and the frame rate is 25 FPS. 




Fig. 7. Walker sequence 2. Tracking results of the proposed generalized kernel tracker. Frames 10, 55, 80, 105, 140, 183 are shown. The video size and 
frame rate are same as Fig. [6| 




Fig. 8. Walker sequence 3. Tracking results of the proposed generalized kernel tracker Frames 20, 98, 152, 220, 444, 553 are shown. The video is of size 
352 X 288 and frame rate 30 FPS. 




TABLE II 

Running time per frame (seconds). The stochastic particle 
filter tracker has run 5 times and the standard deviation is 
also reported. 



Sequence 


MS 


Particle filter 


Ours 


cubicle 1 


0.0156 


0.352 ±0.025 


0.0155 


walker 3 


0.0169 


0.331 ± 0.038 


0.0142 



Fig. 9. Cubicle sequence 1. Tracking results of the proposed tracker (top) 
and particle filtering (bottom). Frames 16, 30, 41, 45 are shown. The video 
is of size 352 x 288 and 30 FPS. 



tracker, and much faster than the particle filter. This coincides 
with the theoretical analysis: our generalized kernel tracker's 
computational complexity is independent of the number of 
support vectors, so in the test phrase, the complexity is 
almost same as the standard MS. One may argue that the 
on-line update procedure introduces some overhead. But the 
generalized kernel tracker employs the L-BFGS optimization 
algorithm which is about twice faster than MS, as shown in 
121 . Therefore, overall, the generalized kernel tracker runs as 
fast as the MS tracker. Because the particle filter is stochastic, 
we have run it 5 times and the average and standard deviation 
are reported. For our tracker and MS, they are deterministic 
and the standard deviation is negligible. Note that the com- 
putational complexity if the particle filter tracker is linearly 
proportional to the number of particles. 

We have run another test on cubicle sequence 2. We show 
some results of our method and particle filtering in Fig. [TT] 
Although all the methods can track this sequence successfully, 
the proposed method achieves most accurate results. We see 



that when the tracked object turns around, our algorithm is 
still able to track it accurately. Table III summarizes the 
quantitative performance. Our method is also slightly better 
MS. Again we see that on-line update does indeed improve 
the accuracy. We have also reported the tracking failure rates 
on this video. Our tracker with on-line update has the lowest 
tracking failures and the one without on-line update is the 
second best. These results are consistent with the previous 
experiments. 

To demonstrate the effectiveness of the on-line SVM learn- 
ing, we switch off the on-line update and run the tracker 
on the walker sequence 3. We plot the ^i-norm absolute 



TABLE III 

The average tracking error against the ground truth (pixels) 

ON THE cubicle SEQUENCE 2. THE MEAN AND STANDARD DEVIATION 
ARE REPORTED. WE ALSO REPORT THE TRACKING FAILURE RATES. 





MS 


Particle filter 


Ours (w/o update) 


Ours (update) 


error 


5.7±3.5 


8.4 ±3.4 


5.5 ±3.2 


4.2 ±2.8 


FRo.20 


4.6% 


15.4% 


3.1% 


0.0% 


FRo.25 


0.0% 


3.1% 


0.0% 


0.0% 
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Fig. 10. The -^i-norm absolute error (pixels) of the object's center against 
the ground truth on the cubicle sequence 1. The two figures correspond to 
and 2/-axis, respectively. The proposed tracker with on-line updating gives 
the best result. As expected, the proposed tracker without updating shows a 
similar performance with the standard MS tracker. 



Fig. 12. The -^i-norm absolute error (pixels) of the object's center against the 
ground truth on the walker sequence 3. The two figures correspond to x-, 
and y-axis, respectively. It clearly shows that on-line update of the generalized 
kernel tracker is beneficial: without on-line update, the error is larger. 




Fig. 11. Cubicle sequence 2. Tracking results of the proposed tracker 
(top) and particle filtering (bottom). Frames 9, 55, 60, 64 are shown. The 
video is of size 352 x 288 and frame rate 30 FPS. 



deviation of the tracked object's center in pixels at each frame 
in Fig. [12] Apparently, at most frames, on-line update produces 
more accurate tracking results. The average Euclidean tracking 
error is 8.0 ± 4.9 pixels with on-line update and 12.7 ± 5.8 
pixels without on-line update. 

Conclusions that we can draw from these experiments are: 
(1) The proposed generalized kernel-based tracker performs 
better than the standard MS tracker on all the sequences that 
we have used; (2) On-line learning often improves tracking 
accuracy. 



V. Conclusion 

To summarize, we have proposed a novel approach to kernel 
based visual tracking, which performs better than conventional 
single-view kernel trackers Q, |[T3ll . Instead of minimizing 
the density distance between the candidate region and the 
template, the generalized MS tracker works by maximizing 
the SVM classification score. Experiments on localization and 
tracking show its efficiency and robustness. In this way, we 
show the connection between standard MS tracking and SVM 
based tracking. The proposed method provides a generalized 
framework to the previous methods. 

Future work will focus on the following possible avenues: 

• Other machine learning approaches such as relevance 
vector machines (RVM) (421, might be employed to learn 
the representation model. Since in the test phrase, RVM 
and SVM take the same form, RVM can be directly 
used here. RVM achieves comparable recognition accu- 
racy to the SVM, but requires substantially fewer kernel 
functions. It would be interesting to compare different 
approaches' performances; 

• The strategy in this paper can be easily plugged into a 
particle filter as an observation model. Improved tracking 
results are anticipated than for the simple color histogram 
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particle filter tracker developed in ll4ll . 

Appendix 

Generally Collins' modified mean shift fSOl (Equation ([12])) 
cannot guarantee to converge to a local maximum. It is obvious 
that a fixed point x* obtained by iteration using Equation ([12]) 
will not satisfy 

V/(x*) = 0. 

/(•) is the original cost function. Therefore, generally, x* will 
not even be an extreme point of the original cost function. In 
the following example, x* obtained by Collins' modified mean 
shift converges to a point which is close to a local minimum, 
but not the exact minimum. 

Mixture of Gaussian kernels 



-2 







desired convert 


^ence / 


initial / \ 




^position p \ 




final convergence with \ 


/ final convergence with 


standard mean shift \ 


, / modified mean shift in 




\J [Collins, 2003] 



Fig. 13. With negative weights, the modified mean shift proposed in f30l 
may not be able to converge to the local maximum. In this case, it converges 
to a position close to a local minimum (not the exact minimum). The standard 
mean shift converges to the nearest minimum. 



In Fig. [13] we give an example on a mixture of Gaussian 
kernel which contains some negative weights. In this case both 
the standard MS and Collins' modified MS fail to converge to 
a maximum. 
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